Internet Info 1997 December

home *** CD-ROM | disk | FTP | other *** search

/ Internet Info 1997 December / Internet_Info_CD-ROM_Walnut_Creek_December_1997.iso / ietf / urn / urn-archives / urn-ietf.archive.9610 / 000091_owner-urn-ietf _Thu Oct 24 05:10:14 1996.msg < prev next >

Wrap

Internet Message Format | 1997-02-19 | 7KB

Received: (from daemon@localhost) by services.bunyip.com (8.6.10/8.6.9) id FAA08944 for urn-ietf-out; Thu, 24 Oct 1996 05:10:14 -0400 Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.6.10/8.6.9) with SMTP id FAA08939 for <urn-ietf@services.bunyip.com>; Thu, 24 Oct 1996 05:10:12 -0400 Received: from nic.cafax.se by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA24239 (mail destined for urn-ietf@services.bunyip.com); Thu, 24 Oct 96 05:09:38 -0400 Received: from [192.71.220.137] (zap.swip.net [192.71.220.137]) by nic.cafax.se (8.8.2/8.8.2) with ESMTP id LAA00967; Thu, 24 Oct 1996 11:09:34 +0200 (MET DST) X-Sender: m-3329@mailbox.swip.net Message-Id: <v03007802ae94cf233d55@[192.71.220.137]> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Date: Thu, 24 Oct 1996 11:08:19 +0200 To: urn-ietf@bunyip.com From: Patrik Faltstrom <paf@swip.net> Subject: [URN] UNICODE or not UNICODE? Cc: splinter@bunyip.com Sender: owner-urn-ietf@services.bunyip.com Precedence: bulk Reply-To: Patrik Faltstrom <paf@swip.net> Errors-To: owner-urn-ietf@bunyip.com Terry wrote > - why care? the NSS is supposed to be opaque > - does this imply that a) NSSs should be formed originally > in Unicode, or that b) NSSs in other coded character sets > must be translated/transliterated into Unicode in forming > URNs, or c) something else? The problem with not defining a character set is that it will be impossible to do any comparison between two URNs. We need to have some ability to do comparisons, and the only reasonable way of doing that is to use _one_ character set. To answer the second question, you have to have a urn in a different character set in some cases, for example in a client which does not use UNICODE. You then have to do translation. What we did in Digger, the Whois++ server Bunyip has, was using the rules for comparison (decomposition + sorting) and translation rules that the UINCODE consortium have defined. The UNICODE tables include for each character one kind of equivalence which is the rule for decomposition of that code point in UNICODE into more than one other code point. One example, the letter '=C4'. In UNICODE, this character is defined as: 00C4;LATIN CAPITAL LETTER A WITH DIAERESIS;Lu;0;L;0041 0308;;;;N; LATIN CAPITAL LETTER A DIAERESIS;;;00E4; One can here see that the codepoint, U+00C4, is equivalent to U+0041 followed by U+0308. One can also see that the lower case version of this character is U+00E4 (among other things). When comparing two strings, and one of them include "U+00C4" and the other one the sequence "U+0041"+"U+0308" these strings should be considered equal in the sense of the UNICODE spec. Note that I am not talking about ISO-10646 here, as I am not at all familiar with what parts of this is included in the 10646 spec. This is UNICODE 2.0 we are talking about! When comparing strings, one have to do decomposition, and the definition of decomposition is: >Decomposition. > (1) the process of separating or analyzing a text element into > component units. These component units may not > have any functional status, but may be simply formal units, i.e.= , > abstract shapes; (2) the process of replacing a > code element with multiple code elements, which, together, > represent the original code element in some > manner, e.g., the shapes associated with the resulting > code elements may combine to form the shape associated > with the original code element. This means that before comparing the strings that include "U+00C4" and "U+0041"+"U+0308", all code points have to be decomposed into its maximal decomposition possible. I.e. "U+00C4" have to be changed to "U+0041"+"U+0308", and these have to be sorted (i.e. one can know from the code tables that "U+0041" is to be before "U+0308" in a composed character). THEN we do the comparison codepoint by codepoint. If we wanted to do case insensitive matching, we use the information from the UNICODE consortium about what is a lower case character. Terry continues: >When I translate that 8859-6 name into Unicode I have more >than one possible outcome (depending on whether I keep it >simple, using 0621--064A or use Unicode code points that >include diacritical marks If one have a look at the UNICODE tables, one can see that for example U+0621 is a final form of a whole series of other glyphs, for example =46E80;ARABIC LETTER HAMZA ISOLATED FORM;Lo;0;R;<isolated> 0621;;;;N; GLYPH FOR ISOLATE ARABIC HAMZAH;;;; which is the special glyph used in isolated form. You see that U+FE08 should be equivalent to U+0621, so when using the decomposition rules before comparing strings, U+FE80 and U+0621 are considered equivalent. This is exactly why decomposition is needed. >or use Unicode code points that >indicate glyph variants of a letter, such as 06AA, "Arabic >Letter Swash Kaf," which is lexically the same as 0643, >"Arabic Letter Kaf`" or specify some ligatures). I don't know Arabic, but I am just following the rules that UNICODE consortioum have set up, and according to these rules, U+06AA and U+0643 is not equivalent characters when comparing: 06AA;ARABIC LETTER SWASH KAF;Lo;0;R;;;;;N;ARABIC LETTER SWASH CAF;;;; 0643;ARABIC LETTER KAF;Lo;0;R;;;;;N;ARABIC LETTER CAF;;;; I am not arguing if this descision by the UNICODE consortium was correct or not, but _someone_ that they trusted must have told them these are different characters. >Say >I can have outcomes A, B, and C, all of them legitimate >representations of my 8859-6 name in Unicode. Are >urn:mynamespace:A, urn:mynamespace:B, and urn:mynamespace:C >equivalent? Once the URN is passed to the resolver, is >it contemplated that the resolver will translate the URN's >NSS back into 8859-6? When mapping to/from ISO-8859-6 and UNICODE, the consortium have a table which one should use. In that table I can read out that U+0643 in ISO-8859-6 is 0xE3, but that U+06AA is missing from ISO-8859-6. A client have because of this know that UNICODE is what the URN is specified in, and do appropriate mappings -- i.e. if a URN have U+06AA as one of the characters, that URN can not be entered directly on a client working in ISO-8859-6 (and neither on a client working in ISO-8859-1 or US-ASCII). In this case %encoding can be used. Patrik -------------------------------------------------------------------- Senior Researcher, Tele2/SwipNet Stockholm, Sweden paf@swip.net Phone: +46-8-56264000 urn:inet:urn.paf.se In theory, there's no difference between theory and practice, but in practice, there is.